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Two methods to reduce the CPU time needed for the numerical evaluation of 
cross-sections and similar quantities are discussed. 



1 Introduction 

The numerical evaluation of cross-sections and similar quantities can be expen- 
sive in terms of CPU time, in particular if radiative correction are included, 
and/or if the model has parameters that need to be scanned over. This contri- 
bution discusses new and improved algorithms for phase-space integration and 
parallelization of parameter scans to speed up the calculation. Both methods 
have successfully been implemented in version 4 of the package FormCal&Q 

2 Phase-space integration 

The recently completed Cuba library 1 ** provides four subroutines for multi- 
dimensional numerical integration. All four have a very similar invocation 
and can thus be interchanged easily, e.g. for comparison. The flexibility of a 
general-purpose method is particularly useful in the setting of automatically 
generated code. The following algorithms are contained in the Cuba library: 

Vegas is the classic Monte Carlo algorithm which uses importance sam- 
pling for variance reduction. It iteratively builds up a piecewise constant weight 
function, represented on a rectangular grid. Each iteration consists of a sam- 
pling step followed by a refinement of the grid. The present implementation 
uses Sobol quasi-random numbers for sampling. 

Suave is a crossover between Vegas and Miser and combines Vegas-style 
importance sampling with globally adaptive subdivision: Until the requested 
accuracy is reached, the region with the largest error is bisected along the axis 
in which the fluctuations of the integrand are reduced most. In each half the 
number of new samples is prorated for the fluctuation. 

Divonne is a further development of the CernLib routine D151. It is 
intrinsically a Monte Carlo algorithm but has cubature rules built in for com- 
parison, too. The variance-reduction method is stratified sampling. In a first 
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Figure 1: Characteristics of the Cuba routines for the phase-space integration of e+e — » tt-y 
at a requested relative accuracy of 3 X 10 — 3 . 



step, a tessellation of the integration region is constructed in which all subrc- 
gions have an approximately equal value of the spread, defined as 

s(r) = — Vbl(r) fmax fix ) — min f(x ) V (1) 

Minimum and maximum here are sought using methods from numerical op- 
timization. The subregions are then sampled independently with a number 
of points extrapolated to reach the required accuracy. For each region, the 
latterly obtained value is compared to the initial rough estimate and if the two 
are not compatible within their errors, the region is subdivided or sampled 
once more. Additions to CernLib's D151 are the final comparison phase and 
the possibility to point out known extrema, to speed up convergence. 

Cuhre is a new implementation of DCUHRE. It is a deterministic algorithm 
which employs cubature rules of a polynomial degree. Variance reduction is 
by globally adaptive subdivision: Until the requested accuracy is reached, 
bisect the region with the largest error along the axis with the largest fourth 
difference. 

Fig. compares the performance of the four algorithms for a real phase- 
space integration of the process e + e~ — > tt'y. Above all it is very important to 
have several independent integration methods to cross-check the results. 

3 Parallelization 

Calculations in models like the MSSM, where not all input parameters are 
yet known, often require extensive scans to cover an interesting part of the 
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parameter space. Such a scan can be a real CPU hog, but on the other hand, 
the calculation can be performed completely independently for each parameter 
set and is thus an ideal candidate for parallelization. The real question is thus 
not how to parallelize the calculation, but how to automate the parallelization. 

The method is quite general, but consider FormCalc for a specific instance. 
The user may specify parameter loops by defining preprocessor variables, e.g. 

#define L00P1 do 1 TB = 2, 30 
These definitions are substituted at compile time into a main loop (see below) . 
The obstacle to automatic parallelization is that the loops are user-defined and 
in general nested. A serial number is introduced to unroll the loops: 



serial version 



L00P1 
L00P2 



calculate cross-section 
1 continue 



parallel version 



serial = 

L00P1 

L00P2 



serial = serial + 1 
if ( serial not in allowe 
calculate cross-section 
1 continue 



range ) goto 1 



The serial number range can be specified on the command line so that 
it is quite straightforward to distribute patches of serial numbers on different 
machines. Most easily this is done in an interleaved manner, since one then 
does not need to know to which upper limit the serial number runs, i.e. if 
there are N machines available, send serial numbers 1, N + 1, 2N + 1, etc. on 
machine 1, send serial numbers 2, N + 2, 2N + 2, etc. on machine 2, . . . 

This procedure is completely automated in FormCalc: The user once cre- 
ates a . submitrc file in his home directory and lists there all machines that 
may be used, one on each line. In the case of multi-processor machines he 
puts the number of processors after the host name. The executable compiled 
from FormCalc code, typically called run, is then simply prefixed with submit. 
For instance, instead of "run uuuu 500,1000" the user invokes "submit run 
uuuu 500,1000." The submit script uses ruptime to determine the load of 
the machines and ssh to log in. Handling of the serial number is invisible to 
the user. 
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